There is a presentation which goes with this handout, use it to add notes if you need to.

Before we start, remember we need to do a couple of things…

1. Tell R where your data are

See the introduction handout if you can’t remember how.

Check it worked using:

list.files()
##  [1] "01_intro-to-R.html"              "01_intro-to-R.Rmd"              
##  [3] "02_dealing-with-data-dplyr.html" "02_dealing-with-data-dplyr.Rmd" 
##  [5] "04_intro-to-ggplot_files"        "04_intro-to-ggplot-slides.html" 
##  [7] "04_intro-to-ggplot-slides.Rmd"   "04_intro-to-ggplot.html"        
##  [9] "04_intro-to-ggplot.Rmd"          "basics_of_rstudio.pdf"          
## [11] "climate_anomalies_1880-2016.csv" "compensation.csv"               
## [13] "intro-outline.md"                "MOMv3.3.txt"                    
## [15] "README.md"

or look at the Files tab

2. Load the packages needed

If you downloaded the entire “tidyverse” then you should already have the ggplot2 package. If not, then you will need to get it using “install.packages”

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(ggplot2)

3. Import the data

First we are going to work with the dataset MOMv3.3.txt, which contains the body mass of all mammals on Earth (MOM = Masses Of Mammals). You can find the original source by Smith et al. 2003, here

MOM <- read_tsv("MOMv3.3.txt", col_names=F)
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   X3 = col_character(),
##   X4 = col_character(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_double(),
##   X8 = col_double(),
##   X9 = col_character()
## )
#Let's name the columns so they are easier to work with
names(MOM) <- c("continent", "status", "order", "family", "genus", "species", "log_mass_g", "avg_mass_g", "reference")

4. Look at the data

glimpse(MOM)
## Observations: 5,731
## Variables: 9
## $ continent  <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF...
## $ status     <chr> "extant", "extant", "extant", "extant", "extant", "...
## $ order      <chr> "Artiodactyla", "Artiodactyla", "Artiodactyla", "Ar...
## $ family     <chr> "Bovidae", "Bovidae", "Bovidae", "Bovidae", "Bovida...
## $ genus      <chr> "Addax", "Aepyceros", "Alcelaphus", "Ammodorcas", "...
## $ species    <chr> "nasomaculatus", "melampus", "buselaphus", "clarkei...
## $ log_mass_g <dbl> 4.85, 4.72, 5.23, 4.45, 4.68, 4.59, 4.53, 4.60, 5.9...
## $ avg_mass_g <dbl> 70000.3, 52500.1, 171001.5, 28049.8, 48000.0, 39049...
## $ reference  <chr> "60", "63, 70", "63, 70", "60", "75", "60", "1", "2...
range(MOM$avg_mass_g)
## [1]      -999 190000000
#oops, we have some missing valuses, denoted as -999. Let's remove those...
library(naniar)
## Warning: package 'naniar' was built under R version 3.4.3
MOM <- MOM %>%
  replace_with_na(replace = list(avg_mass_g = -999, log_mass_g = -999.00))
range(MOM$avg_mass_g, na.rm=TRUE)
## [1] 1.8e+00 1.9e+08

What plots do I need?

Before we start to make plots, let’s think about what we’d like to do more carefully.

  1. What kind of data do we have? e.g. quantitative? qualitative?
  2. What is the main question we’d like to answer?
  3. What would be interesting to show about our data?

We can spend all day making graphs, but there are a few rules of thumb we can use to avoid wasting effort, and to avoid going down a rabbit hole that distracts us from our main purpose.

1. What’s your point?

Are you visualizing data for exploratory purposes (to understand it better) or for explanatory purposes (to make a specific point to someone else)? The goal is not always to display all of the data as neatly and completely as possible!

2. Choose the right plot

Think about what each kind of plot can tell you and how your data type constrains your options. Would quantitative or qualitative data be best shown on this plot? Would the plot be best to display discrete data (categories, integers) or continuous data? Does this plot show all the data, or does it break it into distinct groups?

  • histograms can show the distribution of 1 quantitative variable
  • bar charts or boxplots can compare quantities in 2 or more different categories
  • scatterplots show the relationship of two items
  • line charts can track changes through time, or show the relationship of 2 or more variables
  • pie charts compare parts of a whole
  • facets can divide information into distinct groups
  • layers can add more information as color, shape, size etc.
  • advanced plotting commands may include: heatmaps, layered or stacked bar graphs, maps, animated plots, 3D plots, and more!

3. Less is more

Don’t overcomplicate and try to be as straightforward as possible. Especially for explanatory plots, you want to try to focus your reader’s attention.

4. Use color intentionally

Use color to emphasize certain parts of a graph, or to push other parts to the background. For example, color could differentiate groups in a graph, or help your key results that are the most exciting to stand out!

5. Add a title and good axis labels to help read and understand your graph

6. Experiment, get feedback, and iterate

Can someone else understand it?

There are many other options and different opinions on making graphics. Let’s start with some of the most common ones, and make some plots!

How do we make a ggplot - step by step

  1. Look at your data and decide what kind of plot you want to make
  2. Decide what variables and observations you need to use to make the plot. Are these already existing rows and columns in your dataset?
  3. Set up the ggplot “base”
  4. Add layers to your graph to specify the way it looks
  5. Add text, labels, etc, to help interpret your graph

Let’s try!

It’s OK to be confused by ggplot when you are first starting. The syntax is different than base R and follows a different pattern to interpret it. But it is worth the effort!

ggplot2 uses the “grammar of graphics”, an idea that making a visualization shouldn’t require memorizing the names of different kinds of plots in R (or any coding language), but rather having a conceptual understanding of the parts of your data that you want to show, and then putting together those pieces as “layers”. ggplot’s syntax forces us to think specifically about how the structure of our data, and our question, needs to fit into the structure and options of a graphic - cool!

ggplot uses a basic template, and beyond the initial set-up, you get to choose which pieces you want to use, and what you want to specify. Start simple, and build up to complexity.


Some terminology:

  • ggplot - the main function where you specify the dataset and variables to plot (your “canvas”)
  • geoms - geometric objects
  • geom_point(), geom_bar(), geom_density(), geom_line(), geom_area()
  • aes - aesthetics
  • shape, size, color, fill, linetype, transparency (alpha)
  • scales - define how your data will be plotted
  • continuous, discrete, log
  • + vs %>%

All plots begin with the “canvas” :

ggplot(data, aes(x,y))

Inside ggplot, aes() is the “aesthetics” of the plot - in other words, which data do you want to use or “map” into your graphic. It always goes in the order x and y. If you want to make a distribution of only one variable, you would only specify x.

Alone, this won’t plot anything, because it just tells R that you want to make a plot, with data.

We need to use + to add other “layers” to the plot and tell R to show us the plot.

To tell it what kind of plot to make, you need to add a “geom_” layer. geom_type() tells ggplot what kind of “geometry” you want your graphic to have. Google is the best place to go for this, or the ggplot2 website. There is built-in help in R, but the website is better to use if you need clarification or ideas.

ggplot uses a different syntax to plot than the base R plot functions, so it looks a little different and takes some time to get used to. Start simple, and build up to complexity!

ggplot() sets defaults for layers * Combine layers with ggplot() using + * You will need to have a + at the end of every row, except the last one! * Must have at least one layer to plot * geom_ specifies how you want to plot, and may include statistical transformations * Add additional layers, as necessary * Order matters

ggplot() arguments: * default dataset * set of mappings * ‘Aesthetics’ from variables aes() * ggplot(GarlicMustardData, aes(x=bio12, y=AvgAdultHeight))

Add components of figures with layers * geom_point() * Simple scatter plot showing annual precipitation and average adult height ggplot(GarlicMustardData, aes(x=bio12, y=AvgAdultHeight)) + geom_point()

Let’s start with a single variable, and make a distribution - a histogram!

Histogram

# base R histogram
hist(MOM$log_mass_g)

# ggplot histogram
ggplot(MOM, aes(x = log_mass_g)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# or we could build it iteratively!
myplot <- ggplot(MOM, aes(x = log_mass_g))
myplot + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# specify binwidth
ggplot(MOM, aes(log_mass_g)) + 
  geom_histogram(binwidth = 0.5)
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# plot on log scale
ggplot(MOM, aes(avg_mass_g)) + 
  geom_histogram() + 
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# add better labels
ggplot(MOM, aes(log_mass_g)) + 
  geom_histogram(binwidth = 0.5) + 
  labs(x = "Log mass in grams", y = "count", title = "Masses of Mammals")
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# compare 2 or more categories
ggplot(MOM, aes(log_mass_g)) + 
  geom_histogram(binwidth = 0.5) + 
  labs(x = "Log mass in grams", y = "count", title = "Masses of Mammals") + 
  facet_wrap(~status)
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

# add color
ggplot(MOM, aes(log_mass_g)) + 
  geom_histogram(binwidth = 0.5, aes(fill=status)) + 
  labs(x = "Log mass in grams", y = "count", title = "Masses of Mammals")
## Warning: Removed 1372 rows containing non-finite values (stat_bin).

To make a different kind of plot, we change the geom type that we are using.

Boxplot

If we want to summarize the data using boxplots, we get a box that shows the median value of your data in the middle, with the top and bottom edges of the box representing the 25th and 75th percentile of the data. In other words, the middle 50% of the data lies between those edges. The ends of the lines coming from the edges represent 1.5 times the inter-quartile range (the range of the box values). Values below or above those whiskers are considered ‘outliers’ to R. You will want your x variable to be categorical or discrete so that it is clear what the boxes represent, and your y variable to be a quantitative variable.

# make boxplots by continent
ggplot(MOM, aes(x = continent, y = log_mass_g)) + 
  geom_boxplot()
## Warning: Removed 1372 rows containing non-finite values (stat_boxplot).

Now you try your own! Make a boxplot of the quantitative variable by some other categorical variable, and try adding layers for labels, color, or to compare different categories (facet).

Barplot

To compare counts in a dataset by a particular category, use a bar plot. In the background, barplot uses the transformation stat_count()to count the number of rows for each category.

ggplot(MOM, aes(x=order)) +
  geom_bar()

#rotate axis labels to make them easier to read
ggplot(MOM, aes(x=order)) +
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

#add color
ggplot(MOM, aes(x=order)) +
  geom_bar(aes(fill=continent)) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Scatterplot

And for two quantitative variables, or to assess a relationship, we’d typically want to see a scatterplot.

ggplot(MOM, aes(avg_mass_g, log_mass_g)) + 
  geom_point()
## Warning: Removed 1372 rows containing missing values (geom_point).

#change size
ggplot(MOM, aes(avg_mass_g, log_mass_g)) + 
  geom_point(aes(size=avg_mass_g))
## Warning: Removed 1372 rows containing missing values (geom_point).

#add transparency to overlapping points
ggplot(MOM, aes(avg_mass_g, log_mass_g)) + 
  geom_point(aes(size=avg_mass_g), alpha=0.5)
## Warning: Removed 1372 rows containing missing values (geom_point).

#add color and facets
#change size
ggplot(MOM, aes(avg_mass_g, log_mass_g)) + 
  geom_point(aes(size=avg_mass_g, col=order)) + 
  facet_wrap(~continent)
## Warning: Removed 1372 rows containing missing values (geom_point).

Adding a trendline

Sometimes, we can convey more information by also plotting a trendline. Let’s try this using the iris dataset, so see if petal width can be predicted by sepal width.

ggplot(iris, aes(Sepal.Width, Petal.Length)) + geom_point() + stat_smooth()
## `geom_smooth()` using method = 'loess'

It doesn’t look like much, but the data look somewhat clumped. Since we know there are multiple species in this dataset, let’s try making three separate trendlines for each of the 3 species.

ggplot(iris, aes(Sepal.Width, Petal.Length, group=Species)) + geom_point() + stat_smooth()
## `geom_smooth()` using method = 'loess'

OK, so that’s a little more interesting!

Time series

What if we want to track something through time? Then we need a geom_line. Let’s try using data downloaded for global temperature anomalies 1880-2016. You can find this dataset in the day1 folder. Go ahead and load it into R now.

clim <- read_csv("climate_anomalies_1880-2016.csv")
## Parsed with column specification:
## cols(
##   Source = col_character(),
##   Date = col_date(format = ""),
##   Mean = col_double()
## )

This is a pretty large dataset, with anomalies estimated by two different sources: GISTEMP (Global Land-Ocean Temperature Index) and GCAG (Global Component of Climate at a Glance). So it might take awhile for the plot to appear on your laptop. Or you could use your new dplyr skills to subset it to be a smaller version of the dataset (e.g. the past 50 years).

ggplot(clim, aes(Date, Mean, group=Source, col=Source)) + geom_line() + geom_hline(yintercept=0)
## Warning in as.POSIXlt.POSIXct(x): unknown timezone 'zone/tz/2018c.1.0/
## zoneinfo/America/New_York'

# or plot just the overall trend
ggplot(clim, aes(Date, Mean, group=Source, col=Source)) + stat_smooth() + geom_hline(yintercept=0, linetype="dashed")
## `geom_smooth()` using method = 'gam'

Themes to control overall appearance of plots

ggplot2 has themes!

You can find a complete list of themes at http://docs.ggplot2.org/current/ggtheme.html. And some more info here.

You can install a new theme using install.packages("ggthemes")

library(ggthemes)

Let’s just see what this looks like on one of the plots we previously made.

iris_plot <- ggplot(iris, aes(Sepal.Width, Petal.Length, group=Species, col=Species)) + geom_point() + stat_smooth()

# excel 
iris_plot + theme_excel()
## `geom_smooth()` using method = 'loess'

# The Economist
iris_plot + theme_economist()
## `geom_smooth()` using method = 'loess'

# Fivethirtyeight
iris_plot + theme_fivethirtyeight()
## `geom_smooth()` using method = 'loess'

# Tufte
iris_plot + theme_tufte()
## `geom_smooth()` using method = 'loess'

# Stephen Few
iris_plot + theme_few()
## `geom_smooth()` using method = 'loess'

# minimal
iris_plot + theme_minimal()
## `geom_smooth()` using method = 'loess'

# black and white
iris_plot + theme_bw()
## `geom_smooth()` using method = 'loess'

# solarized
iris_plot + theme_solarized()
## `geom_smooth()` using method = 'loess'

How to save before you go

ggsave to the rescue! You can save your plot as a png, pdf, jpeg, and more! (see help file for full description. If you don’t specify plot, it will default to the most recent (last) plot that you made. Unless otherwise specified, it will save into your current working directory.

ggsave(filename = "iris.png", plot=iris_plot, width=6, height=4)
## `geom_smooth()` using method = 'loess'

Additional information

Geometric object * geom_point() * geom_line()

Statistical visualization * geom_smooth() * geom_bar() * geom_histogram() * geom_boxplot()

Dataset and aesthetic adjustments * scale_continuous() * scale_manual() * lims() * labs() * guide_legend() * theme(), theme_bw(), theme_classic()

Grouping related data * [Same Plot] Assign grouping variables as default or layer aes() for: group, color, or shape * [Multiple Plots] use facet_wrap() or facet_grid()

Remember

The process of making graphics in ggplot is like painting. First you lay down the canvas using the ggplot function. Then you add different layers to that canvas by specifying how you want the data to look and what you want the labels to state using the geoms (for example geom_point or geom_histogram) or facets (facet_wrap) or other layers shown above.

The best way to build up to more complex plots is to start by just adding one layer at a time, run the code, and make sure it works they way you expected. If it looks good, try to add your next layer or your next grouping (maybe color or facet). Run again and check that it worked. Especially when you are first starting, do not try to write a long command with many layers all at once. Just try one step at a time, and slowly build.